Hierarchical Clustering Algorithm with Dynamic Tree Cut for Data Imputation
نویسندگان
چکیده
Missing values are very common in real-world datasets for a variety of reasons. Deleting data points with missing values can negatively impact the performance of data analysis methods (e.g., machine learning, data mining). Using a human expert to restore the missing values is expensive and time consuming. The alternative is to impute the missing values during data preprocessing using the known values. This improves performance for data analysis, assuming the imputed values are correct. Unfortunately, imputation algorithms which use all the known values (e.g., mean imputation) often have considerable variance between the imputed and real values. More complex imputation algorithms (e.g., deck and model-based) choose a suitable subset of the data points for imputation. However, a weakness of these algorithms is they use all the variables (i.e., attributes) for imputation even if some of the variables are uncorrelated. Here, we propose a framework called ClustFrame for imputation algorithms that chooses suitable subsets for both data points and variables. We also present a ClustImpute algorithm based on our framework that uses single imputation with (1) hierarchical clustering, (2) dynamic tree cut, and (3) a regression model to impute all missing values. Using nine datasets from the UCI repository and an empirically collected complex dataset, we evaluate our algorithm against several existing algorithms including stateof-the-art model-based algorithms that use multiple imputation. Results show that ClustImpute achieves significantly higher imputation accuracy on many of the datasets. We conclude with some suggestions on improvements
منابع مشابه
Dynamic Tree Cut: in-depth description, tests and applications
In hierarchical clustering, clusters are defined as branches of a cluster tree. The constant height branch cut, a commonly used method to identify branches of a cluster tree, is not ideal for cluster identification in complicated dendrograms. We describe a new dynamic branch cutting approach for detecting clusters in a cluster tree based on their shape. Compared to the constant height cutoff, o...
متن کاملGraph Clustering by Hierarchical Singular Value Decomposition with Selectable Range for Number of Clusters Members
Graphs have so many applications in real world problems. When we deal with huge volume of data, analyzing data is difficult or sometimes impossible. In big data problems, clustering data is a useful tool for data analysis. Singular value decomposition(SVD) is one of the best algorithms for clustering graph but we do not have any choice to select the number of clusters and the number of members ...
متن کاملروش نوین خوشهبندی ترکیبی با استفاده از سیستم ایمنی مصنوعی و سلسله مراتبی
Artificial immune system (AIS) is one of the most meta-heuristic algorithms to solve complex problems. With a large number of data, creating a rapid decision and stable results are the most challenging tasks due to the rapid variation in real world. Clustering technique is a possible solution for overcoming these problems. The goal of clustering analysis is to group similar objects. AIS algor...
متن کاملDefining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R
SUMMARY Hierarchical clustering is a widely used method for detecting clusters in genomic data. Clusters are defined by cutting branches off the dendrogram. A common but inflexible method uses a constant height cutoff value; this method exhibits suboptimal performance on complicated dendrograms. We present the Dynamic Tree Cut R package that implements novel dynamic branch cutting methods for d...
متن کاملMissing data imputation in multivariable time series data
Multivariate time series data are found in a variety of fields such as bioinformatics, biology, genetics, astronomy, geography and finance. Many time series datasets contain missing data. Multivariate time series missing data imputation is a challenging topic and needs to be carefully considered before learning or predicting time series. Frequent researches have been done on the use of diffe...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016